This is to certify that the work I am submitting is my own. All external references and sources are clearly acknowledged and identified within the contents. I am aware of the University of Warwick regulation concerning plagiarism and collusion.
No substantial part(s) of the work submitted here has also been submitted by me in other assessments for accredited courses of study, and I acknowledge that if this has been done an appropriate reduction in the mark I might otherwise have received will be made.
This report is divided into four sections as listed below:
Section 1.1 - Food Standards Agency Interventions - Technical Development
Section 1.2 - Food Standards Agency Interventions - Professional Report
Section 2.1 - Book Sales - Technical Development
Section 2.2 - Book Sales - Professional Report
This analysis on the Food Standards Agency data fulfills the below mentioned asks from the managers of the agency:
1.1.1 Visualise distributions across the Local Authorities (LAs) of the % of enforcement actions successfully achieved - for all impact levels combined.
1.1.2 Visualise distributions across the Local Authorities (LAs) of the % of enforcement actions successfully achieved - for all impact levels separately - A to E.
1.1.3 Determine if employing more number of employees increases the likelihood of establishments successfully responding to enforcement actions.
1.1.4 Determine if there is a relationship between proportion of successful responses and the number of FTE food safety employees in each local authority.
1.1.5 Determine if there is a relationship between proportion of successful responses and the number of employees as a proportion of the number of establishments in the local authority.
Below mentioned data dictionary is as per the support sources provided for the question 1. Please note that only Features that are relevant for the analysis are described below:
| Feature | Description |
|---|---|
| Country | Country of Local Authority |
| LAType | Local Authority Type |
| LAName | Local Authority Name |
| Total_Establishments | Total number of establishments in the area |
| Percent_Compliant_A_E | Total % of Establishments that are broadly compliant with the regulations and are rated from category A to E |
| A_Rated | Number of Establishments that are rated A in the area |
| Percent_Compliant_A | Total % of establishments that are broadly compliant and rated A |
| B_Rated | Number of establishments that are rated B in the area |
| Percent_Compliant_B | Total % of establishments that are broadly compliant and rated B |
| C_Rated | Number of establishments that are rated C in the area |
| Percent_Compliant_C | Total % of establishments that are broadly compliant and rated C |
| D_Rated | Number of establishments that are rated D in the area |
| Percent_Compliant_D | Total % of establishments that are broadly compliant and rated D |
| E_Rated | Number of establishments that are rated E in the area |
| Percent_Compliant_E | Total % of establishments that are broadly compliant and rated E |
| Interventions_A_E | Total % of Interventions that have taken place for all impact levels A to E |
| Interventions_A | Total % of Interventions that have taken place for premises rated A |
| Interventions_B | Total % of Interventions that have taken place for premises rated B |
| Interventions_C | Total % of Interventions that have taken place for premises rated C |
| Interventions_D | Total % of Interventions that have taken place for premises rated D |
| Interventions_E | Total % of Interventions that have taken place for premises rated E |
| Employees | Number of Professional Full Time Posts that are currently occupied in the Local Authority Office |
food_data <- read_csv("2019-20-enforcement-data-food-hygiene.csv", guess_max = 500, show_col_types = FALSE)
# Note - The imported dataset has 353 observations, matching with the provided input .csv file
# Basic data checks on the imported data set
# summary check
summary(food_data)
# structure check
str(food_data)
Observations from the above data exploration -
Data Cleaning Steps:
Note: The imported data set has extra columns which are not needed to perform the required analysis.
Data Cleaning Step 1: Renaming the columns to handle columns efficiently in the r environment:
# Utilising the 'colnames' function to rename the Column names
colnames(food_data) <-c(
"Country","LAType","LAName","Total_Establishments","NR_For_Intervention",
"Outside_Programme","Percent_Compliant_A_E","Percent_Compliant_Inc_NR",
"A_Rated","Percent_Compliant_A","B_Rated","Percent_Compliant_B","C_Rated",
"Percent_Compliant_C","D_Rated","Percent_Compliant_D","E_Rated","Percent_Compliant_E",
"Interventions_A_E","Interventions_A","Interventions_B","Interventions_C","Interventions_D",
"Interventions_E","Interventions_NR","Volun_Closure","Seizure",
"Suspension_Revocation_Licence","Hygiene_Emergency_Pro_Notice",
"Pro_Order","Caution","Hygiene_Notices","Detention_Notices",
"Written_Warning","Prosecution_Concluded","Employees")
Data Cleaning Step 2: Identifying and Removing the rows with the missing values. Identifying NAs with the help of ‘Interventions_A’.
food_data <- food_data %>%
filter(!is.na(Interventions_A))
# After removing missing values, we've 347 rows (=353-6)
Data Cleaning Step 3: Correcting for the data types of Percent_Compliant_A and Interventions_A
# Char data type in 'Percent_Compliant_A' because there are 52 entries below with "NP"
food_data %>%
dplyr::group_by(Percent_Compliant_A) %>%
dplyr::summarise(n = n()) %>%
arrange(desc(n))
# Char data type in 'Interventions_A' because there are 24 entries below with "NR"
food_data %>%
dplyr::group_by(Interventions_A) %>%
dplyr::summarise(n = n()) %>%
arrange(desc(n))
# Updating "NP" and "NR" to 0 using dplyr mutate function:
food_data <- food_data %>%
mutate(Percent_Compliant_A = if_else(Percent_Compliant_A == 'NP','0',Percent_Compliant_A),
Interventions_A = if_else(Interventions_A == 'NR','0',Interventions_A))
# Updating the data type of Percent_Compliant_A, Interventions_A, and LAType
food_data <- food_data %>%
mutate(Percent_Compliant_A = as.numeric(Percent_Compliant_A),
Interventions_A = as.numeric(Interventions_A),
LAType = as.factor(LAType))
# Checking summary and str to see if all the updates are made correctly:
summary(food_data)
str(food_data)
Checking visually the correlation between Employees and Interventions A-E:
emp_est_plot <- ggplot(food_data, aes(y=Interventions_A_E, x=Employees)) + geom_point() + labs(x="Employees", y="Interventions A-E", subtitle="The shaded area shows the 95% CI for the best-fitting regression line (r = -0.02)", title = "Variation in Interventions A-E as a function of Number of Employees") + geom_smooth(method=lm)
emp_est_plot
## `geom_smooth()` using formula 'y ~ x'
As seen from the above figure, there is a weak correlation between Employees and Interventions A-E. This weak correlation is confirmed by the below r values:
# Checking the correlation between Employees and Interventions_A_E
rcorr(
food_data$Employees,
food_data$Interventions_A_E
)
## x y
## x 1.00 -0.02
## y -0.02 1.00
##
## n= 347
##
##
## P
## x y
## x 0.6552
## y 0.6552
Utilising the lm model to determine the relationship between Interventions(A-E) and Employees
lm_intervention_employees <- lm(Interventions_A_E ~ Employees, food_data)
summary(lm_intervention_employees)
##
## Call:
## lm(formula = Interventions_A_E ~ Employees, data = food_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -66.304 -4.575 4.067 8.658 13.860
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 87.1091 1.2828 67.905 <2e-16 ***
## Employees -0.1195 0.2675 -0.447 0.655
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.4 on 345 degrees of freedom
## Multiple R-squared: 0.0005787, Adjusted R-squared: -0.002318
## F-statistic: 0.1998 on 1 and 345 DF, p-value: 0.6552
# Confirming the above results via anova()
anova(lm_intervention_employees)
## Analysis of Variance Table
##
## Response: Interventions_A_E
## Df Sum Sq Mean Sq F value Pr(>F)
## Employees 1 31 30.703 0.1998 0.6552
## Residuals 345 53021 153.683
As evident from the above summary of linear model and anova test - it is unlikely (r= -0.02) that employing more number of employees increases the likelihood of establishments successfully responding to enforcement actions [t(345) = 67.9,p = 0.655)]
This section of the report aims to provide the conclusions of the technical development and answer the requirements of the politicians and managers of the food agency.
In order to determine if employing more number of employees increases the likelihood of establishments successfully responding to enforcement actions, we have utilised the linear model. As seen form the below figure, there is weak correlation between employees and successful response to enforcement actions. Analysis form the linear model shows that, Employees is not a significant predictor for successful response to the enforcement actions, t(345) = -0.447, p = 0.655.
Therefore, it is unlikely that employing more number of employees increases the likelihood of establishments successfully responding to enforcement actions.
## `geom_smooth()` using formula 'y ~ x'
This analysis on the Book Sales data fulfills the below mentioned asks from the managers of a publishing company:
2.1.1 Do books from different genres have different daily sales on average?
2.1.2 Do books have more/fewer sales depending upon their average review scores and total number of reviews?
2.1.3 What is the effect of sale price upon the number of sales, and is this different across genres?
Below mentioned data dictionary is as per the description provided for the question 2:
| Feature | Description |
|---|---|
| sold.by | book seller name |
| publisher.type | type of publisher i.e. amazon, big five, indie, single author, small/medium |
| genre | Genre of the book i.e. childrens, fiction and non-fiction |
| avg.review | average review of the book out of 5 |
| daily.sales | average number of sales (minus refunds) across all days in the period |
| total.reviews | total no. reviews for the particular book |
| sale.price | average price for which the book sold across all sales in the period |
Renamed “sold by” to “sold.by” in the input csv file, to avoid run time errors in r environment.
sales_data <- read_csv("publisher_sales.csv", guess_max = 10000, show_col_types = FALSE)
# Note - no. of rows and columns imported are checked against the csv file
Summary statistics:
book_data_summary <- sales_data %>%
dplyr::group_by(genre) %>%
dplyr::summarize(mean=mean(daily.sales),median = median(daily.sales),n = n())
book_data_summary
## # A tibble: 3 × 4
## genre mean median n
## <chr> <dbl> <dbl> <int>
## 1 childrens 55.6 55.4 2000
## 2 fiction 106. 106. 2000
## 3 non_fiction 75.9 75.9 2000
Analyzing genre vs daily.sales plot
ggplot(sales_data, aes(x=genre, y = daily.sales))+
geom_boxplot()+
labs(title="Daily Sales by Genre", x="Genre", y="Daily Sales")
Below-mentioned modifications are done to the imported data set to make it suitable for the analysis:
Data entry in the non_fiction genre with daily sales as -0.53, updated it to 0.
All the character data types are converted to factor data types.
# Updating daily sales value
sales_data <- sales_data %>%
mutate(daily.sales = ifelse(daily.sales<0,0,daily.sales))
# Updating the data type
# generating a vector to keep the column names
columns <- c("sold.by", "publisher.type", "genre")
# Set the correct measurement levels or data types
sales_data[columns] <- lapply(sales_data[columns], as.factor)
# Checking if the correct data type corrections are made or not
str(pull(sales_data,sold.by))
## Factor w/ 13 levels "Amazon Digital Services, Inc.",..: 11 1 1 1 13 13 1 6 1 1 ...
str(pull(sales_data,publisher.type))
## Factor w/ 5 levels "amazon","big five",..: 2 3 5 5 2 2 5 2 5 5 ...
str(pull(sales_data,genre))
## Factor w/ 3 levels "childrens","fiction",..: 1 3 3 2 1 1 2 1 2 1 ...
# all character data types are converted to factor
Creating the data set for the requirement 2.1
# Filtering out genre and daily.sales
daily_sales_by_genre <- sales_data %>%
select(genre,daily.sales)
# lm model for predicting daily sales based on genre
lm_daily_sales_genre <- lm(daily.sales~genre, data=daily_sales_by_genre)
Utilising the anova function to test whether genre has a significant effect on the daily sales.
anova(lm_daily_sales_genre)
## Analysis of Variance Table
##
## Response: daily.sales
## Df Sum Sq Mean Sq F value Pr(>F)
## genre 2 2562524 1281262 2590.6 < 2.2e-16 ***
## Residuals 5997 2966052 495
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Using the NHST approach
summary(lm_daily_sales_genre)
##
## Call:
## lm(formula = daily.sales ~ genre, data = daily_sales_by_genre)
##
## Residuals:
## Min 1Q Median 3Q Max
## -102.396 -13.326 -0.076 13.249 102.094
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.5773 0.4973 111.76 <2e-16 ***
## genrefiction 50.3087 0.7033 71.53 <2e-16 ***
## genrenon_fiction 20.2888 0.7033 28.85 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.24 on 5997 degrees of freedom
## Multiple R-squared: 0.4635, Adjusted R-squared: 0.4633
## F-statistic: 2591 on 2 and 5997 DF, p-value: < 2.2e-16
Marginal means:
lm_daily_sales_genre_emm <- emmeans(lm_daily_sales_genre, ~genre)
lm_daily_sales_genre_emm
## genre emmean SE df lower.CL upper.CL
## childrens 55.6 0.497 5997 54.6 56.6
## fiction 105.9 0.497 5997 104.9 106.9
## non_fiction 75.9 0.497 5997 74.9 76.8
##
## Confidence level used: 0.95
Also, from the summary() approach, we are not able to calculate the difference between fiction and non_fiction genre. Therefore,
using pairs function, to see if the differences among the genre are significant
using the contrast function in conjunction with confint to calculate the mean differences with associated confidence intervals
lm_daily_sales_genre_pairs <- pairs(lm_daily_sales_genre_emm)
lm_daily_sales_genre_pairs
## contrast estimate SE df t.ratio p.value
## childrens - fiction -50.3 0.703 5997 -71.535 <.0001
## childrens - non_fiction -20.3 0.703 5997 -28.849 <.0001
## fiction - non_fiction 30.0 0.703 5997 42.686 <.0001
##
## P value adjustment: tukey method for comparing a family of 3 estimates
As evident from the above calculations based on Tukey test, the differences among each of the genre is significant (p<0.0001).
lm_daily_sales_genre_contrast <- confint(pairs(lm_daily_sales_genre_emm))
lm_daily_sales_genre_contrast
## contrast estimate SE df lower.CL upper.CL
## childrens - fiction -50.3 0.703 5997 -52.0 -48.7
## childrens - non_fiction -20.3 0.703 5997 -21.9 -18.6
## fiction - non_fiction 30.0 0.703 5997 28.4 31.7
##
## Confidence level used: 0.95
## Conf-level adjustment: tukey method for comparing a family of 3 estimates
# plots for comparison
p.gain <- ggplot(summary(lm_daily_sales_genre_emm), aes(x=genre, y=emmean, ymin=lower.CL, ymax=upper.CL)) + geom_point() + geom_linerange() + labs(x="Genre", y="Average Daily Sales", subtitle="Error Bars are Extent of 95% CIs", title = "Distribution across Genre")
p.contrasts <- ggplot(lm_daily_sales_genre_contrast, aes(x=contrast, y=estimate, ymin=lower.CL, ymax=upper.CL)) + geom_point() + geom_linerange() +labs(x="Genre - Contrast", y="Average Daily Sales", subtitle="Error Bars are Extent of 95% CIs",title = "Distribution - Genre Contrast")+scale_x_discrete(guide = guide_axis(n.dodge = 2))
grid.arrange(p.gain, p.contrasts, ncol=2)
Visualizing the distribution of daily sales across genre with the help of Null Hypothesis model and Alternative Hypothesis model
Step 1 - Arguments for Null Hypothesis and Alternative Hypothesis model
# Arguments for Null Hypothesis Model
m.daily.sales.intercept <- lm(daily.sales~1, data=daily_sales_by_genre)
null.mean <- coef(m.daily.sales.intercept)
null.sd <- sigma(m.daily.sales.intercept)
# Arguments for Alternative Hypothesis Model - Running lm model and calculating emmeans
alternative.means <- summary(lm_daily_sales_genre_emm)$emmean
alternative.sd <- sigma(lm_daily_sales_genre)
Step 2 - Visually Comparing Null Hypothesis and Alternative Hypothesis model.
# Checking the Null Hypothesis Distribution:
null_plot <- ggplot(daily_sales_by_genre,aes(x=daily.sales, fill = genre))+
geom_histogram(aes(y=..density..),position="identity", alpha=0.3, binwidth=1)+
stat_function(fun=dnorm, args=list(mean=null.mean, sd=null.sd)) + labs(x="Daily Sales", y="Density", title="Null Hypothesis", fill="Genre")
# Checking the Alternative Hypothesis Distribution:
colours <- scales::hue_pal()(3)
alt_plot <- ggplot(daily_sales_by_genre, aes(x=daily.sales, fill=genre))+
geom_histogram(aes(y=..density..), position="identity", alpha=0.3, binwidth=1)+
labs(x="Daily Sales", y="Density", title="Alternative Hypothesis", fill="Origin") +
stat_function(fun=dnorm, args=list(mean=alternative.means[1], sd=alternative.sd), col=colours[1]) +
stat_function(fun=dnorm, args=list(mean=alternative.means[2], sd=alternative.sd), col=colours[2]) +
stat_function(fun=dnorm, args=list(mean=alternative.means[3], sd=alternative.sd), col=colours[3])
grid.arrange(null_plot, alt_plot, ncol=1)
Above plots confirms our finding that three genre differs with respect to mean. Therefore, the Alternative Hypothesis plot governs our actual distribution.
For this ask, we need to perform the Multiple Linear Regression. We will first get the overview of the underlying data by performing following steps:
Evaluate the correlation matrix
Individual effect of average review scores on daily sales
Individual effect of total reviews on daily sales
Multiple regression - determining main effects and interactions effects
Checking the data distribution of daily.sales with respect to total reviews and average reviews
sales_tot_rev_plot <- ggplot(sales_data, aes(y=daily.sales, x=total.reviews)) + geom_point() + labs(x="Total Reviews", y="Daily Sales", subtitle="The shaded area shows the 95% CI for the best-fitting regression line", title = "Daily Sales vs Total Reviews") + geom_smooth(method=lm)
sales_avg_rev_plot <- ggplot(sales_data, aes(y=daily.sales, x=avg.review)) + geom_point() + labs(x="Average Reviews", y="Daily Sales", subtitle="The shaded area shows the 95% CI for the best-fitting regression line", title = "Daily Sales vs Average Reviews") + geom_smooth(method=lm)
sales_dist <- grid.arrange(sales_tot_rev_plot, sales_avg_rev_plot, ncol=1)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
As seen in the above Figure, Daily sales is positively correlated with Total Reviews and not correlated with Average Reviews.
Creating a data frame for daily sales, average reviews, and total reviews
# Selecting sales and avg review and total reviews columns
sales_rev_df <- sales_data %>%
select(daily.sales,avg.review,total.reviews)
Checking the correlation matrix to see the correlation between each pair of variables and a p-value:
sales_corr_matrix <- rcorr(as.matrix(sales_rev_df))
sales_corr_matrix
## daily.sales avg.review total.reviews
## daily.sales 1.00 0.0 0.66
## avg.review 0.00 1.0 0.10
## total.reviews 0.66 0.1 1.00
##
## n= 6000
##
##
## P
## daily.sales avg.review total.reviews
## daily.sales 0.747 0.000
## avg.review 0.747 0.000
## total.reviews 0.000 0.000
Determining the individual effect of Average Reviews and Total Reviews on daily sales:
First, effect of average review scores on daily sales through lm() function
# Linear model for daily sales vs avg reviews
lm_sales_avg_rev <- lm(daily.sales~avg.review, data = sales_rev_df)
summary(lm_sales_avg_rev)
##
## Call:
## lm(formula = daily.sales ~ avg.review, data = sales_rev_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -79.414 -22.299 -4.837 18.943 128.948
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 80.0533 2.9509 27.128 <2e-16 ***
## avg.review -0.2211 0.6854 -0.323 0.747
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.36 on 5998 degrees of freedom
## Multiple R-squared: 1.735e-05, Adjusted R-squared: -0.0001494
## F-statistic: 0.1041 on 1 and 5998 DF, p-value: 0.747
# 95% CI for average reviews
cbind(coefficient=coef(lm_sales_avg_rev), confint(lm_sales_avg_rev))
## coefficient 2.5 % 97.5 %
## (Intercept) 80.0533475 74.268426 85.838269
## avg.review -0.2211249 -1.564779 1.122529
Second, effect of total reviews on daily sales through lm() function
# Linear model for daily sales vs total reviews
lm_sales_tot_rev <- lm(daily.sales~total.reviews, data = sales_rev_df)
summary(lm_sales_tot_rev)
##
## Call:
## lm(formula = daily.sales ~ total.reviews, data = sales_rev_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -103.203 -14.824 -1.026 13.620 138.424
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.875791 1.077625 7.308 3.05e-13 ***
## total.reviews 0.537047 0.007818 68.695 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.71 on 5998 degrees of freedom
## Multiple R-squared: 0.4403, Adjusted R-squared: 0.4402
## F-statistic: 4719 on 1 and 5998 DF, p-value: < 2.2e-16
# 95% CI for total reviews
cbind(coefficient=coef(lm_sales_tot_rev), confint(lm_sales_tot_rev))
## coefficient 2.5 % 97.5 %
## (Intercept) 7.8757911 5.7632580 9.9883242
## total.reviews 0.5370474 0.5217215 0.5523733
Multiple Regression - checking the combined effect of average reviews scores and total reviews on the daily sales:
# lm model to predict the daily sales as a function of average and total reviews
lm_sales_avg_tot_rev <- lm(daily.sales ~ avg.review + total.reviews, data = sales_rev_df)
summary(lm_sales_avg_tot_rev)
##
## Call:
## lm(formula = daily.sales ~ avg.review + total.reviews, data = sales_rev_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -103.396 -14.645 -1.059 13.690 122.428
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.872183 2.341239 10.196 < 2e-16 ***
## avg.review -3.943920 0.513113 -7.686 1.76e-14 ***
## total.reviews 0.543329 0.007823 69.452 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.6 on 5997 degrees of freedom
## Multiple R-squared: 0.4458, Adjusted R-squared: 0.4456
## F-statistic: 2412 on 2 and 5997 DF, p-value: < 2.2e-16
The results from the above table show that the daily sales decreases by statistically significant amount of £3.94 t(5997) = -7.7, p < 0.0001, for every extra increase in average reviews, holding the total reviews constant. Also, when controlling the average reviews, the daily sales increase by statistically significant amount of £0.54 t(5997) = 69.5, p< 0.0001, for every extra increase in total reviews.
# Generating the confidence intervals for the estimations approach
cbind(coefficient=coef(lm_sales_avg_tot_rev), confint(lm_sales_avg_tot_rev))
## coefficient 2.5 % 97.5 %
## (Intercept) 23.872183 19.2825123 28.4618540
## avg.review -3.943920 -4.9498061 -2.9380331
## total.reviews 0.543329 0.5279928 0.5586651
The above 95% CI for avg.reviews and total.reviews follows a very stringent range and does not include 0. Therefore, these values are significantly different from zero.
Visualizing the surface plots for the main effect of avg.reviews and total.reviews on daily.sales:
sales_preds <- tibble(avg.review = unlist(expand.grid(seq(0,5,0.5), seq(0, 250, 5))[1]),
total.reviews = unlist(expand.grid(seq(0, 20, 2), seq(0, 250, 5))[2]))
sales_preds <- mutate(sales_preds,
sales_hat = predict(lm_sales_avg_tot_rev, sales_preds))
ggplot(sales_preds, aes(avg.review, total.reviews)) + geom_contour_filled(aes(z = sales_hat)) + guides(fill=guide_legend(title="Daily Sales")) + labs(x = "Average Review", y = "Total Reviews", title = "Variation in Daily Sales wrt. Total and Average Reviews")
Checking if there is an interaction between total reviews and average reviews. Running a lm() model including interaction effects as well in addition to main effects.
sales_intr <- lm(daily.sales ~ avg.review*total.reviews, data = sales_rev_df)
summary(sales_intr)
##
## Call:
## lm(formula = daily.sales ~ avg.review * total.reviews, data = sales_rev_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -104.08 -14.63 -0.92 13.82 92.33
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.547788 4.177990 15.210 < 2e-16 ***
## avg.review -13.683943 0.993145 -13.778 < 2e-16 ***
## total.reviews 0.164761 0.034067 4.836 1.36e-06 ***
## avg.review:total.reviews 0.091687 0.008035 11.411 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.36 on 5996 degrees of freedom
## Multiple R-squared: 0.4576, Adjusted R-squared: 0.4573
## F-statistic: 1686 on 3 and 5996 DF, p-value: < 2.2e-16
As seen from the above results, there is a statistically significant positive interaction between average reviews and total reviews when predicting the daily sales.
Determining if the interaction effects are significant
vif(lm_sales_avg_tot_rev)
## avg.review total.reviews
## 1.011033 1.011033
Since the VIF scores are less than 5, therefore it is valid to consider both average and total reviews for the daily sales prediction.
Also, let’s check the model comparison via ANOVA() test to check if having additional complexity of considering interaction between total reviews and average reviews makes sense
anova(lm_sales_avg_tot_rev, sales_intr)
## Analysis of Variance Table
##
## Model 1: daily.sales ~ avg.review + total.reviews
## Model 2: daily.sales ~ avg.review * total.reviews
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 5997 3064016
## 2 5996 2998894 1 65122 130.21 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Visualizing the main effect and interaction effect graphically:
intr_surf_data <- tibble(avg.review = unlist(expand.grid(seq(0,5,0.5), seq(0, 250, 5))[1]),
total.reviews = unlist(expand.grid(seq(0, 20, 2), seq(0, 250, 5))[2]))
intr_surf_data <- mutate(intr_surf_data,
main.hat = predict(lm_sales_avg_tot_rev, intr_surf_data),
intr.hat = predict(sales_intr, intr_surf_data))
surf.main <- ggplot(intr_surf_data, aes(avg.review, total.reviews)) + geom_contour_filled(aes(z = main.hat)) + guides(fill=guide_legend(title="Daily Sales"))+ labs(x = "Average Review", y = "Total Reviews", title = "Variation in Daily Sales wrt. Total and Average Reviews",subtitle = "Main Effects")
surf.intr <- ggplot(intr_surf_data, aes(avg.review, total.reviews)) + geom_contour_filled(aes(z = intr.hat)) + guides(fill=guide_legend(title="Daily Sales"))+ labs(x = "Average Review", y = "Total Reviews", title = "Variation in Daily Sales wrt. Total and Average Reviews", subtitle = "Interaction Effects")
sales_rev_main_int_plot <- grid.arrange(surf.main, surf.intr, nrow = 2)
Creating the data frame by selecting sale.price, daily.sales and genre
sale_price_df <- sales_data %>%
select(sale.price,daily.sales,genre)
Checking the distribution for sales vs sale price across genre
sp_sales_all <- ggplot(sale_price_df, aes(y=daily.sales, x=sale.price)) + geom_point(alpha = 0.15) + labs(x="Sales Price", y="Daily Sales", subtitle="Effect of Sales Price on Daily Sales for all Genre") + geom_smooth(method=lm)
sp_sales_genre <- ggplot(sale_price_df, aes(y=daily.sales, x=sale.price, color = genre)) + geom_point(alpha = 0.15) + labs(x="Sales Price", y="Daily Sales", subtitle="Effect of Sales Price on Daily Sales by Genre") + geom_smooth(method=lm)
sale_price_dist <- grid.arrange(sp_sales_all, sp_sales_genre, ncol=1)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
As seen from the above figure, Sales Price is negatively correlated with Daily Sales.
As shown in the below correlation matrix, daily sales and sales price are negatively correlated.
cor(select(sale_price_df,daily.sales,sale.price))
## daily.sales sale.price
## daily.sales 1.0000000 -0.2776016
## sale.price -0.2776016 1.0000000
Utilising the lm() function to determine the effect of sales price on daily sales
lm_sales_sp <- lm(daily.sales~sale.price, data = sale_price_df)
summary(lm_sales_sp)
##
## Call:
## lm(formula = daily.sales ~ sale.price, data = sale_price_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -80.769 -20.650 -4.634 17.099 130.315
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 112.0818 1.5207 73.70 <2e-16 ***
## sale.price -3.8156 0.1705 -22.38 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29.17 on 5998 degrees of freedom
## Multiple R-squared: 0.07706, Adjusted R-squared: 0.07691
## F-statistic: 500.8 on 1 and 5998 DF, p-value: < 2.2e-16
# Confidence Intervals:
cbind(coefficient=coef(lm_sales_sp), confint(lm_sales_sp))
## coefficient 2.5 % 97.5 %
## (Intercept) 112.081751 109.100621 115.062881
## sale.price -3.815582 -4.149821 -3.481342
# emmeans:
(lm_sales_sp_emm <- emmeans(lm_sales_sp, ~sale.price))
## sale.price emmean SE df lower.CL upper.CL
## 8.64 79.1 0.377 5998 78.4 79.8
##
## Confidence level used: 0.95
The results from the above table shows that there is significant relationship between daily sales and sales price. There is an average decrease of 4 books for every increase in sales price (95% CI =[-4.15,-3.48]). This decrease is significantly different from zero, t(5998)=-22.38, p<.0001.
Determining this effect across the genre - including genre as well in the predictors:
lm_sales_sp_genre <- lm(daily.sales~sale.price + genre, data = sale_price_df)
( lm_sales_sp_genre_emm <- emmeans(lm_sales_sp_genre, ~genre) )
## genre emmean SE df lower.CL upper.CL
## childrens 56.7 0.532 5996 55.7 57.7
## fiction 105.4 0.504 5996 104.4 106.4
## non_fiction 75.3 0.507 5996 74.3 76.3
##
## Confidence level used: 0.95
Using anova() test to compare the above two models i.e. without and with controlling for genre:
anova(lm_sales_sp, lm_sales_sp_genre)
## Analysis of Variance Table
##
## Model 1: daily.sales ~ sale.price
## Model 2: daily.sales ~ sale.price + genre
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 5998 5102529
## 2 5996 2949564 2 2152966 2188.3 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Therefore the model with sales price and genre as predictors is significantly better than the model having sale price only as the predictor.
Checking if daliy sales is different across the different genre:
lm_daily_sales_sp_genre_pairs <- pairs(lm_sales_sp_genre_emm)
lm_daily_sales_sp_genre_pairs
## contrast estimate SE df t.ratio p.value
## childrens - fiction -48.7 0.756 5996 -64.360 <.0001
## childrens - non_fiction -18.6 0.762 5996 -24.344 <.0001
## fiction - non_fiction 30.1 0.702 5996 42.922 <.0001
##
## P value adjustment: tukey method for comparing a family of 3 estimates
This section of the report aims to provide the conclusions of the technical development and answer the requirements of the managers of the publishing company.
The table below (Table 2.2.1) shows the high-level summary statistics of the input book sales data provided to us. The data is evenly distributed in terms of number of records for each genre; also close proximity of mean and median indicates fairly symmetric distribution.
| Genre | Mean | Median | Records |
|---|---|---|---|
| childrens | 55.57726 | 55.390 | 2000 |
| fiction | 105.88591 | 106.005 | 2000 |
| non_fiction | 75.86584 | 75.920 | 2000 |
As the first step of the analyses, below figure helps us to visualise the Daily Sales across Genre. As evident form the below box-plot, there is one data entry in the non_fiction genre with daily sales as negative value, it makes more business sense to update this value to 0 daily sales for further analysis.
To answer this question, we first carried out the anova test and the results are summarised below:
The Daily Sales differs significantly across genre, [F(2,5997)=2591, p < 0.0001]. Similarly, using the summary() function or NHST hints towards the significant differences, taking childrens genre as a reference.
In order to know the individual average sales - we have utilised the estimated marginal means approach. In addition to the means corresponding to each of the genre, we now have associated confidence intervals as well as shown below:
## genre emmean SE df lower.CL upper.CL
## childrens 55.6 0.497 5997 54.6 56.6
## fiction 105.9 0.497 5997 104.9 106.9
## non_fiction 75.9 0.497 5997 74.9 76.8
##
## Confidence level used: 0.95
The left panel of the below figure shows the mean daily sales from each genre. Fiction has the highest sales, followed by non_fiction and childrens genre has the lowest daily sales. The right panel shows the estimates of the difference in daily sales for each pair of genre. For instance, the estimate for the fiction versus non_fiction comparison shows a point estimate of £30 greater gain for non_fiction 95% CI [28.4–31.7].
Below plot shows two possible hypothesis for the concerned requirement. Based on our analysis, the Alternative Hypothesis plot governs our actual distribution and we can reject the Null Hypothesis.
To answer this question, we started with looking at the data distribution. As seen in the below figure, daily sales and total reviews are positively correlated (r = 0.66, p < 0.05); while daily sales and average reviews are weakly correlated (r= -0.004, p > 0.05).
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
Effect of Total Reviews on Daily Sales:
Our analysis shows that there is significant relationship between daily sales and total reviews. There is an average increase of £0.53 sales for every increase in the total review (95% CI = [0.52,0.55]). This increase is significantly different from zero, t(5998)=68.7, p<.0001.
Effect of Average Reviews on Daily Sales:
Our analysis shows that the relationship between daily sales and average reviews is not significantly different from zero. There is an average decrease of £0.22 sales for every point increase in the average review. However, the confidence intervals include zero (95% CI = [-1.6, 1.1]) and this decrease is not significantly different from zero, t(5998)=−0.32, p=0.747.
Effect of Total Reviews and Average Reviews on Daily Sales:
Our analysis shows that the daily sales decreases by statistically significant amount of £3.94 t(5997) = -7.7, p < 0.0001, for every extra increase in average reviews, holding the total reviews constant. Also, when controlling the average reviews, the daily sales increase by statistically significant amount of £0.54 t(5997) = 69.5, p< 0.0001, for every extra increase in total reviews.
Also, our analysis shows that there is a statistically significant positive interaction between average reviews and total reviews when predicting the daily sales. We have utilised anova test to check if having additional complexity of considering interaction between total reviews and average reviews makes sense.
Model comparison shows that a regression model including total reviews, average reviews, and interaction of total and average results in a significantly better overall fit than a model only including total reviews and average reviews F(5996)=130.21, p< 0.0001.
Below visualisation concludes the effect of average and total reviews on daily sales - as the average review of a book increases there is steeper increase in daily sales value as total number of reviews increases. For instance, at an average review of 4.5, we can cover the entire range of sales [from dark blue surface to yellow surface] as we increase the total reviews.
As seen from the below distribution for sales vs sale price (£) across genre, sales price is negatively correlated with Daily Sales.
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
Our analysis shows that there is significant relationship between daily sales and sales price. There is an average decrease of four books for every increase in sales price (95% CI =[-4.15,-3.48]). This decrease is significantly different from zero, t(5998)=-22.38, p<.0001.
Utilised anova test determine to compares two possible models i.e. without and with controlling for genre. The results from the model with sales price and genre as predictors is significantly better than the model having sale price only as the predictor. Based on the results shown below we can conclude that the daliy sales is different across the different genre.
## contrast estimate SE df t.ratio p.value
## childrens - fiction -48.7 0.756 5996 -64.360 <.0001
## childrens - non_fiction -18.6 0.762 5996 -24.344 <.0001
## fiction - non_fiction 30.1 0.702 5996 42.922 <.0001
##
## P value adjustment: tukey method for comparing a family of 3 estimates